This dataset is public available for research.
The details are described in [Cortez et al., 2009].
Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at:
Elsevier
Pre-press (pdf)
bib
In this report a dataset of 1599 red wine instances, each with 12 vriables
discribing the instance, is to be explored. A list of the 12 variables:
1. fixed acidity (tartaric acid - g / dm^3)
2. volatile acidity (acetic acid - g / dm^3)
3. citric acid (g / dm^3)
4. residual sugar (g / dm^3)
5. chlorides (sodium chloride - g / dm^3)
6. free sulfur dioxide (mg / dm^3)
7. total sulfur dioxide (mg / dm^3)
8. density (g / cm^3)
9. pH
10. sulphates (potassium sulphate - g / dm3)
11. alcohol (% by volume)
12. quality (score between 0 and 10)
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality.class : chr "F" "F" "F" "E" ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality quality.class
## Min. :3.000 Length:1599
## 1st Qu.:5.000 Class :character
## Median :6.000 Mode :character
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
The dataset has 1599 observation with 12 discribtive variables.
This a statestical summary of all variables is shown above and will be used
as referance and to understand the variables better
We see in the above histograms:
Fixed acidity has a normal distribution with a median 7.90 and a mean 8.32.
Volatile Acidity has a normal distribution with median 0.5200 and mean 0.5278.
The x-axis is different in both histograms due to the quantity of the two acids.
We see in the above histogram Citric Acid has two long bins at 0 and at 0.48
Also, Citric Acid has a median 0.260 and a mean 0.271.
We see in the above histogram for Residual Sugar has long tail due to outliers.
The Residual Sugar has a median 2.200 and a mean 2.539
In the above Chlorides Histogram we zoomed in to better understand the graph.
Chlorides Histogram has long tail due to outliers.
The Chlorides has a median 0.07900 and a mean 0.08747
In the above histograms we started with the Free Sulfur Dioxide Histogram,
then we did the Total Sulfur Dioxide Histogram, then we combined them togather
since the Free Sulfur Dioxide is part of the Total Sulfur Dioxide.
We noticed that the Free Sulfur Dioxide is mostly in the low levels of the gas.
The Free Sulfur Dioxide has a median 14.00 and a mean 15.87.
The Total Sulfur Dioxide has a median 38.00 and a mean 46.47.
In the above Density Histogram, Density has a normal distribution
with median 0.9968 and mean 0.9967.
In the above pH Histogram, pH has a normal distribution
with median 3.310 and mean 3.311.
We see in the above histogram for Sulphates has long tail due to outliers.
The Sulphates has a median 0.6200 and a mean 0.6581.
We see in the above histogram for Alcohol has a Positively-skewed distribution.
The Alcohol has a median of 10.20 and a mean of 10.42.
We see in the above histogram for Quality normal distribution.
The Quality has a median of 6.000 and a mean of 5.636.
We see most wines score a 5 or 6 in quality.
There is 1599 red wine instances with 12 features:
(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides,
free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol,
quality). There is only one ordered variable, quality and quality.class.
(worst score) ———–> (best score)
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
K, J, I, H, G, F, E, D, C, B, A
Other observations:
* All wine instances score between 3 and 8.
* Both Residual Sugar and Chlorides have long tails.
* Minumum Alcohol % is 8.40 and Maximum Alcohol % is 14.90
The main features of interest in the dataset are fixed acidity and quality.
I’d like to see how the other variables effect the these two features.
All other variables would help. Density and pH would have the a relation with
fixed acidity. The fixed acidity, volatile.acidity, alcohol would have the
most effect on quality.
No.
Residual Sugar and Chlorides have long tails when I graphed them.
I zoomed in when I plot Chlorides Histogram to better understand the graph
because it had a long tail.
I applied some a transformation on the Residual Sugar it is heavily right skewed.
I took my reviewer advice on this matter.
I added a new variable called ‘quality.class’, I converted quality scale to
quality class by converting numbers (0 to 10) to letters (A - K) to better
understand and visualies the quality.
This is an overview of the Bivaritate plots. It is used to better choose the
graphs and to understand the relationship between variables.
##
## Pearson's product-moment correlation
##
## data: red_wine$fixed.acidity and red_wine$citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
We see a positive correlation between Fixed Acidity and Citric Acid.
The Pearson’s product-moment correlation is 0.6717034.
##
## Pearson's product-moment correlation
##
## data: red_wine$fixed.acidity and red_wine$density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6399847 0.6943302
## sample estimates:
## cor
## 0.6680473
We see a positive correlation between Fixed Acidity and Density.
The Pearson’s product-moment correlation is 0.6680473.
##
## Pearson's product-moment correlation
##
## data: red_wine$fixed.acidity and red_wine$pH
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7082857 -0.6559174
## sample estimates:
## cor
## -0.6829782
We see a negative correlation between Fixed Acidity and pH.
The Pearson’s product-moment correlation is -0.6829782.
##
## Pearson's product-moment correlation
##
## data: red_wine$fixed.acidity and red_wine$alcohol
## t = -2.4691, df = 1597, p-value = 0.01365
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.11035580 -0.01268548
## sample estimates:
## cor
## -0.06166827
We see almost no correlation between Fixed Acidity and Alcohol.
The Pearson’s product-moment correlation is -0.06166827.
##
## Pearson's product-moment correlation
##
## data: red_wine$volatile.acidity and red_wine$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
On the other hand, We see a negative correlation between Volatile Acidity and
Citric Acid. The Pearson’s product-moment correlation is -0.5524957.
##
## Pearson's product-moment correlation
##
## data: red_wine$quality and red_wine$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
We see a positive correlation between Quality and Alcohol.
The Pearson’s product-moment correlation is 0.4761663.
##
## Pearson's product-moment correlation
##
## data: red_wine$quality and red_wine$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
We see some negative correlation between Quality and Volatile Acidity
The Pearson’s product-moment correlation is -0.3905578.
##
## Pearson's product-moment correlation
##
## data: red_wine$quality and red_wine$pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
We almost see no correlation between Quality and pH
The Pearson’s product-moment correlation is -0.05773139.
##
## Pearson's product-moment correlation
##
## data: red_wine$quality and red_wine$fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
We almost see no correlation between Quality and Fixed Acidity
The Pearson’s product-moment correlation is 0.1240516.
We saw some positive, negative and no correlation between the features of
interest and other features.
We saw positive correlations between:
Fixed Acidity and Citric Acid
Fixed Acidity and Density Quality and Alcohol
We saw negative correlations between:
Fixed Acidity and pH
Volatile Acidity and Citric Acid
Quality and Volatile Acidity
We saw no correlations between:
Fixed Acidity and Alcohol.
Quality and pH
Yes, We see a negative correlation between Volatile Acidity and Citric Acid. The Pearson?s product-moment correlation is -0.5524957.
That was interesting.
There was three strong relationships in the dataset:
Fixed Acidity and Citric Acid (positive relation - pearson’s correlation = 0.67)
Fixed Acidity and Density (positive relation - pearson’s correlation = 0.67)
Fixed Acidity and pH (negative relation - pearson’s correlation = -0.68)
We see some positive coorelation with consentration in low values of x and y. We also see that quality wine gets better as the x and y values increases.
We are see some negative coorelation between Fixed Acidity vs. pH
We also quality is scttered all over the graph.
We see a positive correlation between Fixed Acidity and Density.
We mostly see the higher guality wines are lower than the lower quality wine.
We some positive correlation between Fixed Acidity/Volatile Acidity and
Citric Acid. Also, we can see some overlabing in the lower x and y values. We also see that quality wine gets better as the x and y values increases.
The lower the quality the smaller the quantile boxes.
There was no relation between Fixed Acidity and Alcohol. However, when we
plotted the ration of Fixed Acidity over Volatile Acidity and Alcohol
we saw a positive coorelation.
Yes, the ration of Fixed Acidity over Volatile Acidity has some intersting
results when plotted with other features.
## [1] "Fixed Acidity (g/dm^3) Statistics"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## [1] "Citric Acid (g/dm^3) Statistics"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## [1] "Quality Statistics"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## Pearson's product-moment correlation
##
## data: red_wine$fixed.acidity and red_wine$citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
We see a positive correlation between Fixed Acidity and Citric Acid.
The Pearson’s product-moment correlation is 0.6717034. Most of the higher quality wines (C-D) are above the regression line while
the lower quality (G-H) are below the regression line. However, quality wine
(E-F) are above and below the regression line. The points are scattered evenly
through the graph. I chose this graph because I wanted to see how the quality
is scattered in this graph. I was not surprised when I saw how quality was scattered in the graph. I droped the top 1 % of Fixed Acidity data because there
was some gaps in the values.
## [1] "Fixed Acidity (g/dm^3) Statistics"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## [1] "Density (g/cm^3) Statistics"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
## [1] "Quality Statistics"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## Pearson's product-moment correlation
##
## data: red_wine$fixed.acidity and red_wine$density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6399847 0.6943302
## sample estimates:
## cor
## 0.6680473
We see a positive correlation between Fixed Acidity and Density.
The Pearson’s product-moment correlation is 0.6680473.
Most of the higher quality wines (C-D) are below the regression line while
the lower quality (G-H) are along the regression line. However, quality wine
(E-F) are above and below the regression line where (F) is mostly above the line.
The points are scattered nicely in the graph. I chose this graph because
I wanted to see how the quality is scattered in this graph. I was not surprised
when I saw how quality was scattered in the graph. I droped the top 1 % of Fixed
Acidity data because there was some gaps in the values.
## [1] "Fixed Acidity (g/dm^3) Statistics"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## [1] "Alcohol (% by volume) Statistics"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## [1] "Quality Statistics"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## Pearson's product-moment correlation
##
## data: red_wine$fixed.acidity and red_wine$alcohol
## t = -2.4691, df = 1597, p-value = 0.01365
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.11035580 -0.01268548
## sample estimates:
## cor
## -0.06166827
We see almost no correlation between Fixed Acidity and Alcohol.
The Pearson’s product-moment correlation is -0.06166827.
Most of the higher quality wines (C-D) are above the regression line while
the lower quality (F-H) are below the regression line. Quality wine (E), on the
other hand, is scattered everywhere. The points are scattered nicely in the graph.
I chose this graph because I was fascinated by how alcohol has an effect on quality
I was surprised when I saw how quality was scattered in the graph. I did some
limitiation on the x-axis by taking out the top 1% of the values.
I this project I worked on Red Wine Quality dataset.
The dataset has 1599 observations and 12 variables. I started by including a
new variable for quality class (A - K) converted from quality measure (0-10).
Then I started to examin each variable to better understand the dataset.
There was some coorelation between variables, some were positive and some were
negative. Some relations were obviouse, such as Fixed Acidity and pH, and some
were surprising to me, such as Fixed Acidity and Alcohol.
Some limition in the dataset were the number of observations.
The more observations we have the better understanding and exploration of the
dataset. To explore the dataset further, I would try and find the realstionship
between all features and the quality feature to be able to predict the quality
of a specific wine.